In this paper, the best predictor for feedback type given neural
activities. This data was collected in a 2019 experiment by Steinmetz et
al. where they were trying to track the neural activities of mice when
exposed to visual stimuli. The dataset that I used contained 18 sessions
and 4 different mice and their neural data + their response type. At
first, an Exploratory Data Analysis was conducted to see if there was
any relationships that could be gathered by exploring certain
characteristics in the data like the meta-data, the specific neuron
activation in trials, the average neuron/spike activation over a whole
session, trends of success between mice, and dimensionality reduction
plots with corresponding clustering analysis. After that, a detailed
data integration method was shown to average out the spks
count to form a dataframe that could be used for predictive modeling.
Immediately after data integration, 4 models were chosen to be
evaluated: Logistic Regression, Random Forest Classification, XGBoost,
and SVM. XGBoost was then chosen to be evaluated on the testing data as
it performed the best. Finally some additional corollaries and data
questions were discussed pertaining to model selection and data
integrity.
This project will be analyzing the neural activity in mice and their response to certain stimuli. The data is a subset of the data collected by Steinmetz et al. (2019). For some background, The experiments were performed on a total of 10 mice over 39 distinct sessions. However, in this subset of data, we will only be looking at sessions 1 through 18 and at 4 out of the 10 mice tested. Each session consisted of hundreds of trials, where certain stimuli were presented to the mice. These stimuli had different contrast levels that measured from 0 - 1 where 0 meant no stimulus. The mice then made a decision by turning a wheel and if the decision was correct they would receive a reward (or a penalty if their decision was wrong). Here is a brief outline of how the feedback was given:
The data that was collected through these experiments were spike trains, which are neuron firings at different time intervals, from the mice’s visual cortex.
Throughout this report, I will explore this data and try to use a predictive model technique to predict the feedback type based on the features including the neuron spikes and the contrast levels/difference. The structure of the report goes as follows:
In this section, I will be exploring some of the data and trying to see if I can point out any trends just based on some plots. In particular I will start by looking at the data structure (i.e the number of trials, number of brain areas, etc.). I will then transition into looking at the specific neuron activity in each trial of a specific session, just to see if there are any trends with neuron activity over time. I will then look at the average spike counts over an entire session. After that we will examine if there are any trends with success rate and different mice. We will then try some dimensionality reduction techniques to visualize the data better and some clustering to see if we can gain more insights.
In this section, I will be explaining how I was able to properly integrate and mold my data so that we can then feed it into a multitude of prediction models. I will also explain how I was able to pair down the spks column in the data to make a tibble to feed into the model.
In this section, I will be talking about how I split the data for model training and testing and the different models I used with their accuracy and other performance metrics.
In this section, I will pick one of the models that worked the best in the previous section and use a pre-determined test set and evaluate its performance.
In the last section, I will be going over my results and reflecting on the project. I will be discussing the results and the final model as well as some additional corollaries and other tangents that can be explored in the future in relation to this project and achieving a better model.
A total of 18 RDS files are provided that contain the records from 18
sessions. In each RDS file, you can find the name of mouse from
mouse_name and date of the experiment from
date_exp.
## # A tibble: 18 × 6
## mouse_name date_exp num_brain_area num_neurons num_trials success_rate
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Cori 2016-12-14 8 734 114 0.605
## 2 Cori 2016-12-17 5 1070 251 0.633
## 3 Cori 2016-12-18 11 619 228 0.662
## 4 Forssmann 2017-11-01 11 1769 249 0.667
## 5 Forssmann 2017-11-02 10 1077 254 0.661
## 6 Forssmann 2017-11-04 5 1169 290 0.741
## 7 Forssmann 2017-11-05 8 584 252 0.671
## 8 Hench 2017-06-15 15 1157 250 0.644
## 9 Hench 2017-06-16 12 788 372 0.685
## 10 Hench 2017-06-17 13 1172 447 0.620
## 11 Hench 2017-06-18 6 857 342 0.795
## 12 Lederberg 2017-12-05 12 698 340 0.738
## 13 Lederberg 2017-12-06 15 983 300 0.797
## 14 Lederberg 2017-12-07 10 756 268 0.694
## 15 Lederberg 2017-12-08 8 743 404 0.765
## 16 Lederberg 2017-12-09 6 474 280 0.718
## 17 Lederberg 2017-12-10 6 565 224 0.830
## 18 Lederberg 2017-12-11 10 1090 216 0.806
This is the dataframe that contains some useful information about
each mouse. It contains the date of the experiment
(date_exp), the number of brain areas activated
(num_brain_area), number of neurons activated
(num_neurons), number of trials conducted that day on said
mouse (num_trials) and the success rate of the mouse that
day (success_rate). This dataframe will be used for
visualization with the next 4 plots.
These are some simple plots to understand some metrics about the 4 mice that we will be examining. In the first plot, we are examining the average number of brain areas activated for each mouse. In this case we see that Hench has the largest amount. In the next chart, we are examining the average number of neurons activated for each mouse and this case Forssmann seems to have the highest neuron activation. This is not what I expected as Hench had the most brain area activation so we would think that Hench would also have the highest neuron activation but that was not the case. The next graph examines the average number of trials per mouse and we see that Hench has the highest number of trials. The last graph looks at the average success rate for each mouse. In this graph, the mouse with the highest success rate seems to be Lederberg. This is interesting because this mouse was not top in any of the other categories but seems to have the highest average success rate.
Now we will look into specific neuron trends for a subset of trials.
All these charts represent different trials in Session 5 and Neuron Activation at different points in time. The colors here represent different parts of the visual cortex (ACA, CA1, DG, MOs, OLF, ORB, PL, root, SUB, VISa). If we take a quick look at the graph we can see that the predominant neurons that get activated are those in the PL, root and SUB region of the visual cortex.One of the main things that we can take away from these types of graphs is that as time goes on in each trial, more neurons start to be activated. This is probably due to the fact that more stimulus is perceived by the eye and therefore the visual cortex meaning that more neurons are starting to be activated.
Next we will take a step back and look at the spike counts over the whole session.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
This chart shows us the average spike counts per trial in Session 5, colored based on the different regions of the visual cortex. I plotted two different lines on this graph. The more jagged line is the true average spike count per trial based on the region and I also plotted a smoothed version of the line without the noise which shows the general trend of the average spike count over the trials. One of things that I can quickly notice from this is that the root spike count decreases over the number of trials and actually approaches the second highest spike count region with is the DG. This might be an overall trend that the root spike decreases over time whereas the DG spikes go up as the mice learns.
Next I will be trying to understand some homogeneity or heterogeneity across mice.
One of the theories that I postulated was that if there were more Neurons per Area that were activated, then the success rate would be higher. (increased brain activity, leading to increased cognition, which leads to better success?) That I why I wanted to create this graph. This graph shows the Success Rate vs. Neurons per Area which is divided by 4 mice. In one way, my theory is somewhat correct. All of the trials with the highest Neurons per Area for each mouse have the the highest success rate (except for Cori). This might suggest that Neurons per Area might be a good predictor for success and therefore the neural data might be useful for our prediction model.
Another theory that I thought about was that if a session contained more
trials, then the success rate would be higher. This is because I thought
given more trials, the mouse would be able to learn the behavior and
therefore they would have a higher success rate. However, as can be seen
from this graph, which plots Success Rate vs. Number of Trials, it seems
like there is no real trend between the number of trials and success
rate for each of the mice. The only one that increases is for the mouse
“Forssmann”, however, since it’s only 1 occurence we can’t make that
assumption for any other mouse.
After all this EDA, I wanted to look at all of my data using scatter visualizations. To do this, I decided to use some dimensionality reduction techniques.
## PC1 PC2 PC3 PC4 PC5
## 1 -0.034991747 0.033391316 0.0090505850 -0.009619982 -0.006876164
## 2 0.009811442 0.009313063 0.0136653577 -0.019631488 -0.012868752
## 3 -0.076981644 0.006165135 -0.0337481430 0.031478341 -0.003493235
## 4 0.004084848 -0.054986791 0.0282715145 -0.014108708 -0.026805826
## 5 -0.007527587 -0.018100793 -0.0009297368 0.006049728 -0.009457768
## 6 0.042830738 0.012275521 0.0015375341 0.008337414 0.005018478
## PC6 PC7 PC8 PC9 PC10
## 1 0.011930323 -0.0088875317 0.010407201 0.021960484 -0.005267264
## 2 -0.011298421 -0.0038875006 0.013693426 0.024021617 0.010346435
## 3 0.028901982 0.0006303371 0.010674014 -0.026758594 -0.005669393
## 4 0.004711971 -0.0043468959 -0.015173346 0.004605398 0.001993591
## 5 -0.023208580 0.0193196606 -0.006570966 0.014304741 0.004435307
## 6 0.033298813 0.0279734741 0.005973874 -0.015354850 -0.000595895
## PC11 PC12 PC13 PC14 PC15 PC16
## 1 -0.023493021 0.011158945 -0.003578111 -0.002788000 0.003187075 0.010267968
## 2 0.007186852 0.011734918 0.009362206 0.016094728 -0.008304498 -0.004995782
## 3 0.009041904 0.007154190 0.002914602 -0.017461663 0.002184964 0.005508529
## 4 -0.001933640 -0.005283694 0.002006262 -0.002332455 -0.013184361 0.016428879
## 5 -0.009772983 0.013068152 0.001730950 -0.007914579 0.002005181 -0.005699089
## 6 -0.005785877 0.022944135 -0.002207292 -0.010920518 -0.001682118 0.010625748
## PC17 PC18 PC19 PC20 PC21
## 1 -0.007288295 -0.0085365739 -0.023131260 -0.002726764 -0.019011583
## 2 -0.016570594 -0.0029533836 -0.009207165 -0.005859314 0.008869040
## 3 0.002233175 -0.0032582798 -0.011385060 0.003362349 -0.003587975
## 4 0.002163244 0.0008800762 0.002000396 -0.006742959 -0.007249034
## 5 0.003247336 0.0001700042 -0.006659110 0.019084461 -0.003959087
## 6 -0.009653364 -0.0021256347 0.015682357 0.007616232 0.007863679
## PC22 PC23 PC24 PC25 PC26
## 1 0.0048695544 -0.033966762 -0.001999276 -0.001873079 -0.001797336
## 2 0.0004103514 -0.003402217 -0.013186873 0.014779622 -0.004445181
## 3 0.0021416157 0.008036610 0.004931979 0.003034632 0.017460336
## 4 0.0143299052 0.004037142 0.005279986 0.003320539 0.013689517
## 5 -0.0072091160 0.007395175 0.003912303 0.005433117 -0.006027848
## 6 0.0079255589 0.002753016 0.002773057 0.002714516 -0.003801210
## PC27 PC28 PC29 PC30 PC31
## 1 0.0132447731 -0.010170965 -0.0057499902 -0.0009268524 -0.0069072136
## 2 0.0105654216 0.006934938 0.0135528014 -0.0039533846 0.0057407415
## 3 0.0148427622 0.003047071 0.0023761171 0.0080697063 -0.0036975683
## 4 0.0026088576 -0.009512265 0.0102734165 -0.0086638314 -0.0140352108
## 5 0.0049461683 0.015655411 0.0003388408 -0.0050746283 0.0007145242
## 6 -0.0002696268 0.009374919 0.0021463316 -0.0048363728 0.0007713241
## PC32 PC33 PC34 PC35 PC36
## 1 -0.0052308081 0.015762414 0.010308035 0.0113725032 0.012485870
## 2 -0.0065254387 -0.002233722 0.001980792 0.0009995878 -0.002344987
## 3 -0.0052290031 -0.005762877 -0.001276806 -0.0051955917 0.006235896
## 4 -0.0017191973 -0.019778871 0.002085328 0.0031629513 0.007201138
## 5 0.0166970105 -0.001097291 0.009087502 -0.0170579707 0.002002289
## 6 0.0004909283 0.024412341 0.005211773 -0.0019822978 0.014434747
## PC37 PC38 PC39 PC40 session_id mouse_name
## 1 0.020016413 -0.003342055 -0.002529983 -0.007912523 1 Cori
## 2 -0.001989591 0.006732457 0.005889249 0.002168956 1 Cori
## 3 0.004307757 0.006072202 0.007534858 0.011620755 1 Cori
## 4 0.021108879 -0.005089718 0.005954414 0.003561705 1 Cori
## 5 0.005669963 0.005577454 -0.019247314 0.003694097 1 Cori
## 6 0.002967799 0.002792670 0.006948629 -0.004365701 1 Cori
The first dimensionality reduction technique I decided to use was Principal Component Analysis (PCA). PCA is a linear dimensionality reduction technique that projects points in a high dimensional space onto a line, called a principal component, while trying to maximize variance and reduce information loss. Above is a data frame with a PC representation the data. Now we will try to visualize the top 2 PCs based on some categorical variable.
This a 2-d representation of our data. On the x-axis, I graphed the 1st principal component and on the y-axis, I graphed the 2nd principal component. I also labeled the points by the session id to see if there were any correlation between the session id and the principal components. It seems that each of the sessions have points that are clustered similarly however, it seems like there isn’t much correlation between the session and the PCs. Let’s try to see if they are correlated to the mice themselves.
This plot is the same one as the one above, however, the coloring of
points is based on the mouse rather than the session id. In this case I
can’t really notice anything that stands out to me as most of the mice’s
points overlap with each other which means that there might not be any
correlation between the PCs and the mice. I then hypothesized that this
might have something to do with the PCA in general. That is why I
created the plot below.
This a plot with the number of PCs on the x-axis and the explained
variance on the y-axis. This plot shows a sharp decline in explained
variance from PC1 to PC2. What this means is that the data might have a
very complex structure which might not be able to be captured by the
linear combinations of PCA. This is why I decided to try a different
dimensionality reduction technique called t-SNE.
## V1 V2 session_id mouse_name
## 1 -13.170317 12.8342018 1 Cori
## 2 -4.557124 -16.4390701 1 Cori
## 3 -9.815183 22.6787194 1 Cori
## 4 -8.035183 -2.8186121 1 Cori
## 5 -7.197910 0.5837375 1 Cori
## 6 8.529935 -6.9993757 1 Cori
Above is a data-frame from doing t-SNE. In this case V1 and V2 represent the two components that are created. t-SNE stands for t-Distributed Stochastic Neighbor Embedding. This is a dimensionality reduction technique, like PCA, but it “pairs high dimensional data points as conditional probabilities. It then minimizes the differences in the conditional probabilities in the higher dimensions and the probabilities in the lower dimensional space”. One of the benefits of t-SNE is that it is a non-linear technique so it might be useful as our dataset it too complex for PCA use. Let’s examine some of the plots.
This is the first plot that I created with t-SNE. On the x-axis we have
the V1 component and on the y-axis we have the V2 component. The color
is also determined by the
session_id. In this case, it
doesn’t look like there is any trend between session_id and
the two components. However, this does seem like a better visualization
of our data.
Now looking at the plot above, it is the same as the one created before
however I have labeled the points by the
mouse_name. This
plot doesn’t seem to show any trends but it shows where some of the
points lie with respect to the mouse_name.
Next I wanted to try some clustering analysis to see if I could gain anything from that.
In this example I tried to do some cluster analysis. Specifically I tried to use k-means clustering on the data. Since our data is of very high dimension, we could not visualize our data as is. Therefore I used my t-SNE version of my data to cluster. As you can see above the clustering doesn’t seem to be great (it seems like lines are just drawn at certain points) so I would say that clustering hasn’t given us any real value.
Now after all this EDA, we need can now begin our modeling process. However, before beginning our modeling process, we need to get our data into the the proper format. Now I will go over my data integration process.
In this section our goal is to transform the data into the form that we want to then use for our predictive modeling section. Below is some of the code that I used to do that.
binename <- paste0("bin", as.character(1:40))
get_trial_functional_data <- function(session_id, trial_id){
spikes <- session[[session_id]]$spks[[trial_id]]
if (any(is.na(spikes))){
disp("value missing")
}
trial_bin_average <- matrix(colMeans(spikes), nrow = 1)
colnames(trial_bin_average) <- binename
trial_tibble = as_tibble(trial_bin_average)%>% add_column("trial_id" = trial_id) %>% add_column("contrast_left"= session[[session_id]]$contrast_left[trial_id]) %>% add_column("contrast_right"= session[[session_id]]$contrast_right[trial_id]) %>% add_column("feedback_type"= session[[session_id]]$feedback_type[trial_id])
trial_tibble
}
get_session_functional_data <- function(session_id){
n_trial <- length(session[[session_id]]$spks)
trial_list <- list()
for (trial_id in 1:n_trial){
trial_tibble <- get_trial_functional_data(session_id,trial_id)
trial_list[[trial_id]] <- trial_tibble
}
session_tibble <- as_tibble(do.call(rbind, trial_list))
session_tibble <- session_tibble %>% add_column("mouse_name" = session[[session_id]]$mouse_name) %>% add_column("date_exp" = session[[session_id]]$date_exp) %>% add_column("session_id" = session_id)
session_tibble
}
These are the functions I used to extract the data from the original
data and properly format it. The first thing I did was I created a
function to get the data from each trial. I first indexed the spikes and
then I checked for any missing values. Then I calculated the average by
using the colMeans() function. Then I assembled the rest of
the table by adding the trial_id,
contrast_left, contrast_right, and the
feedback_type variables. The next function below gets all
the important data from the session. First we want this function to call
gather information for each trial so we instantiate a for loop for it to
call the get_trial_functional_data function to gather
information for all the trials in the session. I then make sure to add
all relevant columns to the dataframe including the
mouse_name, the date_exp, and the
session_id. However, these are just the 2 functions. How
did I make the full dataframe? That is done by the code-snippet
below.
session_list = list()
for (session_id in 1: 18){
session_list[[session_id]] <- get_session_functional_data(session_id)
}
full_functional_tibble <- as_tibble(do.call(rbind, session_list))
full_functional_tibble$session_id <- as.factor(full_functional_tibble$session_id )
full_functional_tibble$contrast_diff <- abs(full_functional_tibble$contrast_left-full_functional_tibble$contrast_right)
full_functional_tibble$success <- full_functional_tibble$feedback_type == 1
full_functional_tibble$success <- as.numeric(full_functional_tibble$success)
head(full_functional_tibble)
## # A tibble: 6 × 49
## bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8 bin9 bin10 bin11
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0490 0.0368 0.0177 0.0150 0.0327 0.0286 0.0313 0.0123 0.0341 0.0191 0.0463
## 2 0.0300 0.0313 0.0341 0.0272 0.0259 0.0313 0.0218 0.0232 0.0232 0.0341 0.0272
## 3 0.0490 0.0504 0.0300 0.0436 0.0245 0.0409 0.0300 0.0381 0.0341 0.0422 0.0559
## 4 0.0559 0.0531 0.0272 0.0613 0.0572 0.0599 0.0450 0.0286 0.0395 0.0354 0.0368
## 5 0.0272 0.0436 0.0313 0.0245 0.0450 0.0381 0.0463 0.0572 0.0477 0.0163 0.0272
## 6 0.0490 0.0218 0.0163 0.0109 0.0123 0.0232 0.0272 0.0327 0.0163 0.0191 0.0300
## # ℹ 38 more variables: bin12 <dbl>, bin13 <dbl>, bin14 <dbl>, bin15 <dbl>,
## # bin16 <dbl>, bin17 <dbl>, bin18 <dbl>, bin19 <dbl>, bin20 <dbl>,
## # bin21 <dbl>, bin22 <dbl>, bin23 <dbl>, bin24 <dbl>, bin25 <dbl>,
## # bin26 <dbl>, bin27 <dbl>, bin28 <dbl>, bin29 <dbl>, bin30 <dbl>,
## # bin31 <dbl>, bin32 <dbl>, bin33 <dbl>, bin34 <dbl>, bin35 <dbl>,
## # bin36 <dbl>, bin37 <dbl>, bin38 <dbl>, bin39 <dbl>, bin40 <dbl>,
## # trial_id <int>, contrast_left <dbl>, contrast_right <dbl>, …
In this code chunk you can see we go through all 18 sessions and call
our get_session_functional_data function which in turn
calls our get_trial_functional_data function. Once we get
all that data, make sure our session_id is actually a categorical factor
by using the as.factor() method. I then also added an
additional column which took into account the difference in contrast
between contrast left and right. Then I also added a column called
success as binary variable that is 0 when the mouse is unsuccessful and
1 when it is. Above you can see how the
full_functional_tibble looks like. This is what I will be
using for my modeling.
set.seed(15)
sample <- sample(c(TRUE, FALSE), nrow(full_functional_tibble), replace=TRUE, prob=c(0.8,0.2))
train <- full_functional_tibble[sample, ]
test<- full_functional_tibble[!sample, ]
Next I wanted to prepare my data for model training and testing. Therefore I split my data into a training and testing set with 80% being in the training and 20% being in the testing. NOTE: I will be using another testing set, so think of this testing as more of a validation set.
train_x <- train %>%
select(-c("mouse_name", "feedback_type", "date_exp", "session_id", "success"))
train_y <- train %>%
select("feedback_type")
test_x <- test %>%
select(-c("mouse_name", "feedback_type", "date_exp", "session_id", "success"))
test_y <- test %>%
select("feedback_type")
For some models, I needed to split the features and the labels. That
is why I used the train_x, train_y,
test_x and test_y dataframes. In the
data-matrices, I removed the categorical features and the
success and feedback_type variables (these
were going to be my labels). In the label matrices I just had the
feedback_type as this was what I was going to predict.
Now that we have all of our data in order, we can finally start on our predictive modeling!
In this section, I will be building 4 different models and assessing
their viability by using some performance metrics. The first model that
we will be using is a Logistic Regression model. This model is actually
a binary classification model (unlike its name suggests) and is probably
the simplest model that we can use to predict the
feedback_type of these mice.
##
## Call:
## glm(formula = feedback_type ~ ., family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4517 -1.1801 0.6080 0.8582 1.8257
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.365e-01 1.302e-01 4.122 3.76e-05 ***
## bin1 -1.411e+01 5.023e+00 -2.808 0.00499 **
## bin2 2.666e+00 5.307e+00 0.502 0.61546
## bin3 -1.236e+01 5.337e+00 -2.316 0.02054 *
## bin4 3.026e+00 5.217e+00 0.580 0.56198
## bin5 -8.447e+00 5.190e+00 -1.628 0.10363
## bin6 -2.838e+00 5.038e+00 -0.563 0.57325
## bin7 -3.219e+00 4.815e+00 -0.668 0.50385
## bin8 3.240e-02 4.923e+00 0.007 0.99475
## bin9 -6.408e+00 4.997e+00 -1.282 0.19973
## bin10 -1.776e+00 4.880e+00 -0.364 0.71582
## bin11 -1.257e+01 5.125e+00 -2.452 0.01419 *
## bin12 -5.450e+00 5.072e+00 -1.075 0.28259
## bin13 3.119e+00 4.896e+00 0.637 0.52410
## bin14 -6.327e+00 5.008e+00 -1.263 0.20643
## bin15 8.631e+00 5.027e+00 1.717 0.08601 .
## bin16 -1.455e+00 5.118e+00 -0.284 0.77613
## bin17 7.086e+00 5.031e+00 1.408 0.15903
## bin18 -1.528e+00 5.192e+00 -0.294 0.76849
## bin19 1.474e+01 5.151e+00 2.861 0.00422 **
## bin20 -5.482e+00 4.944e+00 -1.109 0.26755
## bin21 3.089e+00 4.906e+00 0.630 0.52886
## bin22 3.222e+00 4.972e+00 0.648 0.51697
## bin23 5.275e+00 5.092e+00 1.036 0.30026
## bin24 1.921e+00 4.984e+00 0.385 0.69997
## bin25 2.043e+00 4.999e+00 0.409 0.68271
## bin26 9.137e+00 5.092e+00 1.794 0.07278 .
## bin27 -4.177e+00 5.058e+00 -0.826 0.40893
## bin28 7.508e+00 5.056e+00 1.485 0.13758
## bin29 4.264e+00 5.030e+00 0.848 0.39665
## bin30 2.967e+00 5.154e+00 0.576 0.56484
## bin31 2.909e+00 5.247e+00 0.554 0.57931
## bin32 4.460e+00 5.100e+00 0.875 0.38181
## bin33 4.381e+00 5.137e+00 0.853 0.39377
## bin34 -5.093e+00 5.136e+00 -0.992 0.32143
## bin35 9.526e+00 5.229e+00 1.822 0.06846 .
## bin36 -1.045e+00 5.183e+00 -0.202 0.84021
## bin37 -5.438e-01 5.135e+00 -0.106 0.91565
## bin38 1.556e+01 5.099e+00 3.052 0.00228 **
## bin39 -2.100e+00 5.056e+00 -0.415 0.67796
## bin40 -8.071e+00 4.865e+00 -1.659 0.09712 .
## trial_id -2.413e-03 3.827e-04 -6.305 2.88e-10 ***
## contrast_left -4.890e-01 1.119e-01 -4.371 1.24e-05 ***
## contrast_right -6.940e-01 1.175e-01 -5.909 3.45e-09 ***
## contrast_diff 1.295e+00 1.336e-01 9.696 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4849.4 on 4033 degrees of freedom
## Residual deviance: 4438.7 on 3989 degrees of freedom
## AIC: 4528.7
##
## Number of Fisher Scoring iterations: 4
## [1] "Accuracy: 0.71251193887297"
In this case we can see our model fit with all the predictors above.
The stars in this case represent the most significant features to the
prediction of feedback_type. Our accuracy for this model
seems to be around 71%. For a very simple model like a logistic
regression model, this seems pretty good. Let’s take a look at the
confusion matrix for this model and see where it was not performing so
well.
## Var1 predicted_class Freq
## 1 -1 -1 39
## 2 1 -1 32
## 3 -1 1 269
## 4 1 1 707
This confusion matrix shows us the how many values were missclassified.
In this case we can see that the model did pretty well at classifying
positive values, however, it did not perform well when classifying
negative or failure values. This is an interesting observation as maybe
negative feedback might be more difficult to predict as maybe it’s are
too similar to the positive feedback. Let’s see if we can improve on
that.
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## [1] "Accuracy: 0.71251193887297"
The next model that I tried was called a Random Forest model. This is a
tree based algorithm that creates n trees and employs an ensemble model
method called bagging to help improve accuracy. Bagging works by
creating a lot of weak learners (decision trees) and they train on
different subsets of the dataset. They then spit out some sort of
response, and the final response is calculated by some method like
majority voting (i.e. pick the most common answer). In this case our
Random Forest model gave us the same accuracy as our logistic regression
model at around 71%. One of the things we can also use the Random Forest
model for is to determine feature importances. This can help us
understand which features contribute most to the prediction. In this
case based on the diagram above, we see
trial_id and
contrast_diff seem to have the highest importance. In terms
of model interpretation, we might be thinking that these two features
are very correlated with feedback_type, however, we would
need to conduct additional analysis to be sure of that.
Next I will be trying to use another ensemble tree method that is popular with this data: XGBoost.
## Warning: package 'xgboost' was built under R version 4.2.3
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
## [1] train-logloss:0.601639
## [2] train-logloss:0.543854
## [3] train-logloss:0.499014
## [4] train-logloss:0.465811
## [5] train-logloss:0.439527
## [6] train-logloss:0.424953
## [7] train-logloss:0.410698
## [8] train-logloss:0.399787
## [9] train-logloss:0.390597
## [10] train-logloss:0.383053
## [1] "Accuracy: 0.73352435530086"
The next model that I used was called XGBoost. This is a very popular gradient boosting algorithm (which is another type of ensembling method) and works particularly well with tabular data. If we look at the accuracy, we achieved a significantly higher accuracy of around 73%. Seems like XGBoost does work well after all! Next I wanted to look at why XGBoost was particularly good on our data. I constructed a Confusion matrix to look at the values.
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
Here is the confusion matrix for our XGBoost model. It seems like
something that it did better than our Logistic Regression model was
classifying the negative feedback. Even though it didn’t do the best
job, it is nice to see that there was an improvement with our model.
In the last model, I want to try a model that might be good at classifying this high dimensional data: a SVM.
## Warning: package 'e1071' was built under R version 4.2.3
## [1] "Accuracy: 0.72683858643744"
The last model that I used was a Support Vector Machine (SVM). Support vector machines use a kernel function to help understand and classify high-dimensional data. Since our data fit those categories, I believed that a SVM would be a good fit. As you can see here, our accuracy was pretty good at around 72%. While this certainly was not as good as our XGBoost model, I think it performed pretty well on the data and it was definitely better than our Random Forest and Logistic Regression models.
Since the SVM tries to create a decision boundary on high dimensional data. I wanted to visualize its results on the testing dataset. Since we can’t see all 40+ dimensions of the dataset, I decided to use a dimensionality reduction technique to create the graph below.
The plot above shows the classification of the test data. I visualized
this data by using PCA to bring down the dimensions of the data to 2 and
plotted the classification (either -1 or 1). Some trends that I noticed
were that all the data points seem to lie on these 4 lines/bands which
might explain some homogeneity within the test set. In addition, it
seemed like most of the predicted negative feedback points had a PC1
value of around 100 or greater. In addition we see that in the test set,
there are alot more values for positive feedback than negative feedback
which might be of concern for us.
After all this model building and evaluation, I will choose the XGBoost model as it performed the best on the validation split. Next we will evaluate this model on the test data.
testing=list()
for(i in 1:2){
file_path <- paste('/Users/tejgaonkar/Downloads/test/test', i, '.rds', sep='')
testing[[i]] <- readRDS(file_path)
print(testing[[i]]$mouse_name)
print(testing[[i]]$date_exp)
}
## [1] "Cori"
## [1] "2016-12-14"
## [1] "Lederberg"
## [1] "2017-12-11"
testing_list = list()
for (testing_id in 1:2){
testing_list[[testing_id]] <- get_session_functional_data(testing_id)
}
full_functional_tibble_test <- as_tibble(do.call(rbind, testing_list))
full_functional_tibble_test$session_id <- as.factor(full_functional_tibble_test$session_id )
full_functional_tibble_test$contrast_diff <- abs(full_functional_tibble_test$contrast_left-full_functional_tibble_test$contrast_right)
full_functional_tibble_test$success <- full_functional_tibble_test$feedback_type == 1
full_functional_tibble_test$success <- as.numeric(full_functional_tibble_test$success)
head(full_functional_tibble_test)
## # A tibble: 6 × 49
## bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8 bin9 bin10 bin11
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0490 0.0368 0.0177 0.0150 0.0327 0.0286 0.0313 0.0123 0.0341 0.0191 0.0463
## 2 0.0300 0.0313 0.0341 0.0272 0.0259 0.0313 0.0218 0.0232 0.0232 0.0341 0.0272
## 3 0.0490 0.0504 0.0300 0.0436 0.0245 0.0409 0.0300 0.0381 0.0341 0.0422 0.0559
## 4 0.0559 0.0531 0.0272 0.0613 0.0572 0.0599 0.0450 0.0286 0.0395 0.0354 0.0368
## 5 0.0272 0.0436 0.0313 0.0245 0.0450 0.0381 0.0463 0.0572 0.0477 0.0163 0.0272
## 6 0.0490 0.0218 0.0163 0.0109 0.0123 0.0232 0.0272 0.0327 0.0163 0.0191 0.0300
## # ℹ 38 more variables: bin12 <dbl>, bin13 <dbl>, bin14 <dbl>, bin15 <dbl>,
## # bin16 <dbl>, bin17 <dbl>, bin18 <dbl>, bin19 <dbl>, bin20 <dbl>,
## # bin21 <dbl>, bin22 <dbl>, bin23 <dbl>, bin24 <dbl>, bin25 <dbl>,
## # bin26 <dbl>, bin27 <dbl>, bin28 <dbl>, bin29 <dbl>, bin30 <dbl>,
## # bin31 <dbl>, bin32 <dbl>, bin33 <dbl>, bin34 <dbl>, bin35 <dbl>,
## # bin36 <dbl>, bin37 <dbl>, bin38 <dbl>, bin39 <dbl>, bin40 <dbl>,
## # trial_id <int>, contrast_left <dbl>, contrast_right <dbl>, …
testing_x <- full_functional_tibble_test %>%
select(-c("mouse_name", "date_exp", "session_id", "success", "feedback_type"))
testing_y <- full_functional_tibble_test %>%
select("feedback_type")
To load in the testing data I used the same process as before by loading all the .RDS files and passing them through the same functions that I had defined in the data integration part of the report. This produced the tibble that is above the preceding code-chunk. For my XGBoost model, I need to divide the data into a testing data matrix with all the relevant features and a testing label matrix with all the corresponding labels. These are the two subsets of the dataframe that I created above.
Next I moved on to evaluating the model.
## [1] "Accuracy: 0.786301369863014"
The first evaluation metric I used was accuracy and in this case the model performed pretty well with around 78% accuracy. This is better than the 73% accuracy on the validation set however, it could be an outlier due to the nature of the data. Regardless it seems like it performed well on the testing data.
Next I wanted to make an ROC and calculate the AUC to evaluate model performance
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Setting levels: control = -1, case = 1
## Setting direction: controls < cases
## AUC: 0.8612178
In this case, I wanted to look at an ROC curve (Receiver Operating Characteristic Curve) and the AUC metric (Area Under ROC Curve). The ROC curve shows the false positive rate on the x-axis and the true-positive rate on the y-axis. The diagonal line represents a random classifier, which is just classifying the two classes randomly. A perfect classifier would be having a line going towards the top left corner. A easier metric to use is the AUC which measures the Area underneath the ROC curve. If your AUC is 0.5, then your classifier is essentially random and 1 indicates the perfect classifier. An area between 0.5 and 1 tells you the effectiveness of your classifier with values closer to 1 being desired. Our AUC value for the test set was 0.86 which is very good. AUC is a good metric to use in this case because our testing data is slightly imbalanced. AUC works well on imbalanced data and gives us a good metric on how good our data is to use.
Like always, we want to see how our model classified different categories. Therefore, I made a confusion matrix below.
This confusion matrix here shows the classification using XGBoost on the
testing data. It does a much better job at classifying the negative
feedback and the positive feedback. However, this might be due to data
imbalance as there were more positive feedback values (228) than
negative feedback values (137). This is why I think that AUC is a much
better metric in this case, however, it is good to visualize the true
performance of our model using a confusion matrix.
That concludes the testing section of the report. We will move on now to our final section: Discussion
In this report, we tried to come up with the best predictor to predict feedback types on mice based on their neural activities. I first started out with some Exploratory Data Analysis, trying to understand the underlying trends and features in the data by looking at the meta-data, the specific neural data, the spikes across sessions, understand different trends between mice, and finally some dimensionality reduction and clustering analysis. I then explained how I was able to integrate my data using functions and averaging out the spike counts to feed into a predictor. I also explained how I subsetted my data into training and testing (really validation) sets. Then I talked about the 4 different models I tried: Logistic Regression, Random Forest Classification, XGBoost, and SVM and I talked about their model accuracies and other metrics I used to evaluate their performance. Finally, I chose the XGBoost model as it had the best accuracy. Then I evaluated the XGBoost model on the test set by first integrating the test set using the data integration method that I described, and then evaluating it based on Accuracy, AUC (with an ROC curve) and a confusion matrix.
Throughout this project, the biggest issue was dealing with the
unnatural nature of the data. First of all the data could not be
integrated into a dataframe/tibble without using some sort of aggregate
function. Even with the approach that I took, I definitely lost some
depth of information with the data. This means that there might be other
approaches that keep the data more intact and therefore might have
better performance. In addition, I noticed that all 3 subsets of the
original data seemed to have more examples of positive feedback than
negative feedback. Whenever I looked at a confusion matrix, I noticed
that the model was pretty good at classifying positive feedback but very
bad at negative feedback. If we had more instances or almost equal
instances of both, our models might have been more accurate as they
would truly be able to understand the relationship between the neural
activities and the feedback_type.
If I were to take this project further, another thing that I would like to try is to use a deep learning model (i.e. a feed forward Neural Network). I initially tried using one however, for some reason it would take a very long time to run so I dropped it at the end. However, I think it would be interesting to see how a neural network would perform on this data and if it would give us good accuracy. Since all the data is already numerical, there would be little to no encoding needed for categorical features so I think in theory, a neural network should be pretty easy to implement, barring algorithmic and computation constraints.
Overall, throughout my report, I found that the XGBoost model was the
best at predicting feedback_type given neural activities,
trial_id, contrast_left,
contrast_right, and contrast_diff. Our model
achieved 73% accuracy on the validation set and 78% accuracy on the
testing set with a 0.86 AUC score. This was initially what I thought of
because XGBoost has been a notoriously good learning algorithm on
tabular data and has been very reliable for users in the past. Although
this task is not as easy as it might first seem, XGBoost does a good job
at modeling this data and coming up with a reliable prediction.